[doc] refactor: split technique guidance into docs, skills, and knowledge#2809
[doc] refactor: split technique guidance into docs, skills, and knowledge#2809
Conversation
Clarify which training guidance is stable documentation versus operational enablement so technique knowledge can be maintained without duplicated docs. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
|
/ok to test e397f32 |
Add docs/skills/knowledge split for parallelism strategy selection, based on real recipe layouts from DeepSeek-V2/V3, Qwen3, Kimi-K2, etc. Key insight: MoE models size TP by active params (often TP=1-2) and use EP as the primary scaling dimension, unlike dense models. Verified TP=2+PP=2+SP on cluster with llama32_1b_pretrain_config. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
Move perf-technique skills from flat .md files to subfolders with SKILL.md + card.yaml, consolidating knowledge/techniques/ YAML cards alongside their skills. Add resiliency skill and card. Refactor docs/training/resiliency.md to be concise conceptual docs, offloading operational details to skills/resiliency/SKILL.md. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
Add stable docs page, operational SKILL.md, and machine-readable card.yaml for CUDA graph capture in Megatron Bridge. Update performance-guide.md to reference current cuda_graph_impl/scope config fields instead of the deprecated enable_cuda_graph. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
|
/ok to test 5d8044c |
📝 WalkthroughWalkthroughDocumentation restructuring and expansion across training guides and knowledge organization. Consolidates verbose operational guidance into stable, concise overviews in docs/training/ while establishing a new skills/perf-techniques/ directory with structured SKILL.md and card.yaml pairs for performance techniques and resiliency features. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 19
🧹 Nitpick comments (3)
skills/perf-techniques/cuda-graphs/SKILL.md (1)
217-221: Consider hyphenating compound modifier.Static analysis suggests "TE-scoped" for clarity when used as a compound adjective before a noun.
📝 Optional fix
4. **Memory overhead**: CUDA graphs pin all intermediate buffers for the - graph's lifetime (no memory reuse). TE scoped graphs add a few GB; + graph's lifetime (no memory reuse). TE-scoped graphs add a few GB; full-iteration graphs can increase peak memory by 1.5–2×. `PP > 1` compounds overhead since each stage holds its own graph.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/perf-techniques/cuda-graphs/SKILL.md` around lines 217 - 221, Change the compound modifier "TE scoped graphs" to a hyphenated form "TE-scoped graphs" in the paragraph discussing memory overhead (the line that currently reads "TE scoped graphs add a few GB;"), so the compound adjective is clear when used before the noun; update the occurrence(s) of "TE scoped" to "TE-scoped" to maintain consistent style.skills/resiliency/SKILL.md (1)
263-265: Consider more concise phrasing (optional).Minor style suggestion from static analysis: "is not compatible with" could be "is incompatible with" for conciseness.
📝 Optional fix
-3. **IPR + NeMo-Run**: In-process restart is not compatible with NeMo-Run +3. **IPR + NeMo-Run**: In-process restart is incompatible with NeMo-Run or Slurm preemption plugins. Requires specific PyTorch/NCCL versions and env vars.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/resiliency/SKILL.md` around lines 263 - 265, Replace the phrase "is not compatible with" with the more concise "is incompatible with" in the bullet that begins "IPR + NeMo-Run" so the line reads "IPR + NeMo-Run: In-process restart is incompatible with NeMo-Run or Slurm preemption plugins." Keep the rest of the sentence (the note about specific PyTorch/NCCL versions and env vars) unchanged; update only the quoted wording in the SKILL.md bullet under "IPR + NeMo-Run".skills/perf-techniques/sequence-packing/card.yaml (1)
39-50: Avoid pseudo-values in this machine-readable block.
match_seq_length,">1", and2 * context_parallel_sizeare prose placeholders inside scalar YAML values. That makesrecommended_pathhard for tooling to consume consistently and undercuts the machine-readable-card goal. Consider moving these into explicit constraint fields or structured metadata instead.As per coding guidelines, "Do not add arbitrary defaults for configs, be as explicit as possible".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/perf-techniques/sequence-packing/card.yaml` around lines 39 - 50, The YAML uses prose placeholders inside scalar values (e.g., recommended_path.llm_packed_sft.dataset.packed_sequence_specs.packed_sequence_size: match_seq_length, recommended_path.vlm_in_batch_packing.train.micro_batch_size: ">1", and recommended_path.cp_finetuning.dataset.packed_sequence_specs.pad_seq_to_mult: 2 * context_parallel_size); replace these pseudo-values with explicit, machine-readable fields or structured constraints: for example add a constraint object like packed_sequence_specs.packed_sequence_size_mode: "match_seq_length" or packed_sequence_specs.packed_sequence_size_constraint: {mode: "match_seq_length"}, replace the string range with a min/max field such as vlm_in_batch_packing.train.micro_batch_size_min: 2, and replace arithmetic with a factor field such as cp_finetuning.dataset.packed_sequence_specs.pad_seq_to_mult_factor: 2 plus a reference name context_parallel_size_variable so tooling can compute pad_seq_to_mult at runtime—update llm_packed_sft, cp_finetuning, and vlm_in_batch_packing accordingly so all values are explicit and machine-readable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/parallelisms.md`:
- Around line 508-512: Update the broken cross-reference in
docs/parallelisms.md: replace the non-existent
../knowledge/techniques/parallelism_strategies.yaml link with the relocated card
path under skills/perf-techniques (e.g.,
../skills/perf-techniques/parallelism-strategies/card.yaml) so the "Knowledge
card" link in the "Related Artifacts" section points to the correct card.yaml;
edit the link text that currently references parallelism_strategies.yaml and
ensure it matches the directory name used by the existing skill file
(parallelism-strategies) and the card.yaml filename.
In `@docs/training/communication-overlap.md`:
- Around line 12-13: Update the stale skill file references that point to
pre-split markdown files by changing them to the new per-technique SKILL.md
layout: replace `skills/perf-techniques/tp-dp-comm-overlap.md` with
`skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` and replace
`skills/perf-techniques/expert-parallel-overlap.md` with
`skills/perf-techniques/expert-parallel-overlap/SKILL.md`; also apply the same
path updates to the other occurrences noted (101-102) so all references point to
the new SKILL.md locations.
In `@docs/training/cuda-graphs.md`:
- Around line 45-56: The docs use two different RNG-tracker
names—cfg.rng.te_rng_tracker in the example and use_te_rng_tracker in the
constraints—so pick one and make them consistent; update the constraint line to
refer to the same symbol used in the example (cfg.rng.te_rng_tracker) or
conversely change the example to use use_te_rng_tracker, and ensure the
constraint referencing cuda_graph_impl still reads "cfg.rng.te_rng_tracker must
be True when cfg.model.cuda_graph_impl != 'none'".
- Line 1: The new document docs/training/cuda-graphs.md is orphaned; update the
relevant toctree by adding a reference to "training/cuda-graphs.md" in the docs
index—specifically edit docs/index.md (or the appropriate docs/training index)
and insert the relative path or title entry into the toctree/list so the page is
included in the site navigation; ensure the entry matches the existing toctree
style and run a local docs build to verify the warning is resolved.
In `@docs/training/hybrid-context-parallel.md`:
- Around line 7-8: Update the opening pointer string that currently references
"skills/perf-techniques/hybrid-context-parallel.md" to use the new layout path
"skills/perf-techniques/hybrid-context-parallel/SKILL.md" so the link matches
the rest of the file; search for the literal reference
"skills/perf-techniques/hybrid-context-parallel.md" and replace it with
"skills/perf-techniques/hybrid-context-parallel/SKILL.md".
- Around line 1-8: The new doc "Hybrid / Hierarchical Context Parallel"
(hybrid-context-parallel.md) is not listed in any Sphinx toctree and will be
orphaned; update the relevant toctree (e.g., add an entry in docs/index.md or
the training toctree that aggregates docs/training/*.md) to include
"training/hybrid-context-parallel.md" so the page is reachable in the built docs
and the toc.not_included warning is resolved.
In `@docs/training/megatron-fsdp.md`:
- Line 10: The docs reference the old path
"skills/perf-techniques/megatron-fsdp.md" which is stale; update all internal
links in docs/training/megatron-fsdp.md that point to
"skills/perf-techniques/megatron-fsdp.md" (including the occurrences around the
top and near line 100) to the new path
"skills/perf-techniques/megatron-fsdp/SKILL.md" so internal navigation works;
ensure you replace every exact match of the old filename string with the new
"SKILL.md" path.
In `@docs/training/packed-sequences.md`:
- Around line 8-10: The docs reference
'skills/perf-techniques/sequence-packing.md' which no longer exists; update
every occurrence (including the one at lines ~83-89) to point to the relocated
guide 'skills/perf-techniques/sequence-packing/SKILL.md' so links resolve
correctly; search the file for the string "sequence-packing.md" and replace it
with "sequence-packing/SKILL.md" in the content and any link markup (the
affected file is docs/training/packed-sequences.md).
In `@docs/training/resiliency.md`:
- Around line 15-24: The table entry for "Local checkpointing" is missing its
corresponding documentation sections; add a matching "Local checkpointing"
feature block with "When to Use It" and "Stable Constraints" (and any other
sections used for other features like motivation/usage) following the exact
structure and tone used for entries such as "Async checkpoint save" and
"Preemption", or if this feature should not be documented yet, remove the "Local
checkpointing" row from the table to keep the table consistent; locate
references to "Local checkpointing" in the doc and implement the new sections
with clear guidance on motivation, technical approach, architecture, usage
examples, and implementation details to satisfy the project documentation
requirements.
In `@skills/perf-techniques/expert-parallel-overlap/card.yaml`:
- Around line 27-33: Update the card's known_constraints to mirror the runtime
checks in src/megatron/bridge/training/comm_overlap.py: add a constraint that
moe_shared_expert_overlap must be disabled/false (referencing the
moe_shared_expert_overlap check), require torch version >= 2.6.0 (referencing
the torch version check), and enumerate the recompute-related constraints
enforced in the file (recompute_granularity, recompute_method,
recompute_num_layers) plus the mtp_num_layers constraint so configurations
failing those checks are rejected by the card; ensure each constraint references
the exact setting names used in the code (moe_shared_expert_overlap,
recompute_granularity, recompute_method, recompute_num_layers, mtp_num_layers,
torch version) and phrase them consistently with the existing items.
In `@skills/perf-techniques/expert-parallel-overlap/SKILL.md`:
- Around line 106-110: Update the "Success criteria" in SKILL.md to remove the
hard-coded "41 passed" count and instead require that the targeted test files
pass with zero failures; specifically replace the "41 passed" bullet with a
statement that tests under test_comm_overlap.py (MoE overlap and delayed-wgrad
validation) and test_deepep.py (DeepEP/HybridEP helper activation and GPU
gating) must pass with no failures, so the criteria track those files rather
than a total pytest number.
- Around line 41-49: Add the missing hard requirement that
moe_shared_expert_overlap = False to the "Required constraints" list because the
code asserts this when overlap_moe_expert_parallel_comm is enabled; update
SKILL.md's constraints block to include the exact constraint
"moe_shared_expert_overlap = False" alongside the other items and mention it
applies whenever overlap_moe_expert_parallel_comm is true so readers know it's
not optional.
In `@skills/perf-techniques/hybrid-context-parallel/card.yaml`:
- Around line 51-60: The evidence list in card.yaml references five files that
don't exist; either add the missing files (create
scripts/verify_hybrid_context_parallel.py and the four Megatron-LM stubs under
3rdparty/Megatron-LM/megatron/core and megatron/training) or remove those
entries from the evidence array to keep the card accurate; locate the evidence
key in skills/perf-techniques/hybrid-context-parallel/card.yaml and update it so
only existing files (src/megatron/bridge/training/initialize.py, config.py,
model_load_save.py, and docs/performance-guide.md) remain or add the missing
files with the expected verification script and Megatron modules.
In `@skills/perf-techniques/hybrid-context-parallel/SKILL.md`:
- Around line 94-103: The documentation references a missing verification script
scripts/verify_hybrid_context_parallel.py in SKILL.md which causes "No such
file" errors; either add the missing script or remove/update the verification
commands. Locate the SKILL.md section that runs the verification (the
CUDA_VISIBLE_DEVICES... uv run python -m torch.distributed.run commands) and (a)
if the verification logic exists elsewhere, point the commands to the correct
script/module name and update the examples, or (b) if the verification is
intended to be provided, implement scripts/verify_hybrid_context_parallel.py
matching the CLI flags (--mode mpu/decentralized) and expected behavior, or (c)
remove the verification subsection entirely; update the README references and
any mentions of verify_hybrid_context_parallel.py accordingly.
In `@skills/perf-techniques/megatron-fsdp/card.yaml`:
- Around line 3-6: The torch_fsdp2_runtime status in the card.yaml is
inconsistent with the documented failure in known_limitations; update the
torch_fsdp2_runtime field (in skills/perf-techniques/megatron-fsdp/card.yaml)
from "unclear" to an explicit status reflecting the runtime failure (e.g.,
"failed" or "runtime_failure") and ensure the value aligns with the
known_limitations note about the Torch FSDP2 TypeError from pg_collection so
readers see the runtime failure state consistently.
In `@skills/perf-techniques/megatron-fsdp/SKILL.md`:
- Around line 27-35: The example config block for Megatron FSDP is missing the
cfg.ddp.average_in_collective = False override and thus diverges from the
recommended minimal override and perf-harness overrides; update the snippet that
sets cfg.dist.use_megatron_fsdp, cfg.ddp.use_megatron_fsdp,
cfg.ddp.data_parallel_sharding_strategy, cfg.checkpoint.ckpt_format,
cfg.checkpoint.save, and cfg.checkpoint.load to also set
cfg.ddp.average_in_collective = False so the copy-paste example matches the
recommended override set.
In `@skills/perf-techniques/parallelism-strategies/SKILL.md`:
- Around line 207-208: The cross-reference path in the paragraph mentioning
`skills/perf-techniques/tp-dp-comm-overlap.md` is incorrect; update that
reference to the correct target
`skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` so the link points to the
existing SKILL.md file (edit the occurrence of
`skills/perf-techniques/tp-dp-comm-overlap.md` in the current `SKILL.md` file
and replace it with `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md`).
In `@skills/perf-techniques/sequence-packing/SKILL.md`:
- Around line 39-44: The doc block must mention that enabling pad_cu_seqlens
requires packed-sequence metadata to be present; update the text to explicitly
state that when cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True you must
supply packed-sequence metadata for your dataset (or disable pad_cu_seqlens),
because the dataset code (see src.megatron.bridge.data.datasets.sft assertion in
the packed-sequence handling) will assert that metadata exists; reference
pad_cu_seqlens, cfg.dataset.dataset_kwargs["pad_to_max_length"], and the
assertion in the sft dataset so users with custom packed datasets know to
generate/provide the required packed metadata.
In `@skills/perf-techniques/tp-dp-comm-overlap/card.yaml`:
- Around line 31-36: Add a new known_constraints entry stating that TP overlap
requires Transformer Engine to be present; specifically mention that TP overlap
is disabled when HAVE_TE is false (see the gating in comm_overlap where the code
checks elif not HAVE_TE), so include "TP overlap requires Transformer Engine
(HAVE_TE)". Keep the same YAML list style and place it alongside the existing TP
overlap constraints referencing tensor_model_parallel_size and
sequence_parallel.
---
Nitpick comments:
In `@skills/perf-techniques/cuda-graphs/SKILL.md`:
- Around line 217-221: Change the compound modifier "TE scoped graphs" to a
hyphenated form "TE-scoped graphs" in the paragraph discussing memory overhead
(the line that currently reads "TE scoped graphs add a few GB;"), so the
compound adjective is clear when used before the noun; update the occurrence(s)
of "TE scoped" to "TE-scoped" to maintain consistent style.
In `@skills/perf-techniques/sequence-packing/card.yaml`:
- Around line 39-50: The YAML uses prose placeholders inside scalar values
(e.g.,
recommended_path.llm_packed_sft.dataset.packed_sequence_specs.packed_sequence_size:
match_seq_length, recommended_path.vlm_in_batch_packing.train.micro_batch_size:
">1", and
recommended_path.cp_finetuning.dataset.packed_sequence_specs.pad_seq_to_mult: 2
* context_parallel_size); replace these pseudo-values with explicit,
machine-readable fields or structured constraints: for example add a constraint
object like packed_sequence_specs.packed_sequence_size_mode: "match_seq_length"
or packed_sequence_specs.packed_sequence_size_constraint: {mode:
"match_seq_length"}, replace the string range with a min/max field such as
vlm_in_batch_packing.train.micro_batch_size_min: 2, and replace arithmetic with
a factor field such as
cp_finetuning.dataset.packed_sequence_specs.pad_seq_to_mult_factor: 2 plus a
reference name context_parallel_size_variable so tooling can compute
pad_seq_to_mult at runtime—update llm_packed_sft, cp_finetuning, and
vlm_in_batch_packing accordingly so all values are explicit and
machine-readable.
In `@skills/resiliency/SKILL.md`:
- Around line 263-265: Replace the phrase "is not compatible with" with the more
concise "is incompatible with" in the bullet that begins "IPR + NeMo-Run" so the
line reads "IPR + NeMo-Run: In-process restart is incompatible with NeMo-Run or
Slurm preemption plugins." Keep the rest of the sentence (the note about
specific PyTorch/NCCL versions and env vars) unchanged; update only the quoted
wording in the SKILL.md bullet under "IPR + NeMo-Run".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d69fecaa-5a5e-4c83-8fbc-5de00f65b2bf
📒 Files selected for processing (27)
docs/parallelisms.mddocs/performance-guide.mddocs/training/README.mddocs/training/communication-overlap.mddocs/training/cuda-graphs.mddocs/training/hybrid-context-parallel.mddocs/training/megatron-fsdp.mddocs/training/packed-sequences.mddocs/training/resiliency.mdknowledge/techniques/README.mdskills/perf-techniques/README.mdskills/perf-techniques/cuda-graphs/SKILL.mdskills/perf-techniques/cuda-graphs/card.yamlskills/perf-techniques/expert-parallel-overlap/SKILL.mdskills/perf-techniques/expert-parallel-overlap/card.yamlskills/perf-techniques/hybrid-context-parallel/SKILL.mdskills/perf-techniques/hybrid-context-parallel/card.yamlskills/perf-techniques/megatron-fsdp/SKILL.mdskills/perf-techniques/megatron-fsdp/card.yamlskills/perf-techniques/parallelism-strategies/SKILL.mdskills/perf-techniques/parallelism-strategies/card.yamlskills/perf-techniques/sequence-packing/SKILL.mdskills/perf-techniques/sequence-packing/card.yamlskills/perf-techniques/tp-dp-comm-overlap/SKILL.mdskills/perf-techniques/tp-dp-comm-overlap/card.yamlskills/resiliency/SKILL.mdskills/resiliency/card.yaml
| ## Related Artifacts | ||
|
|
||
| - **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification | ||
| - **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status | ||
|
|
There was a problem hiding this comment.
Fix broken cross-reference to knowledge card.
The pipeline failure indicates the path ../knowledge/techniques/parallelism_strategies.yaml does not exist. Based on the README at knowledge/techniques/README.md, cards have been relocated to skills/perf-techniques/<technique>/card.yaml. Update the reference to point to the correct location.
🔧 Proposed fix
## Related Artifacts
- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification
-- **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status
+- **Knowledge card**: [skills/perf-techniques/parallelism-strategies/card.yaml](../skills/perf-techniques/parallelism-strategies/card.yaml) — structured metadata and validation status📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Related Artifacts | |
| - **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification | |
| - **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status | |
| ## Related Artifacts | |
| - **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification | |
| - **Knowledge card**: [skills/perf-techniques/parallelism-strategies/card.yaml](../skills/perf-techniques/parallelism-strategies/card.yaml) — structured metadata and validation status |
🧰 Tools
🪛 GitHub Actions: Build docs
[warning] 511-511: Sphinx: cross-reference target not found: '../knowledge/techniques/parallelism_strategies.yaml' [myst.xref_missing]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/parallelisms.md` around lines 508 - 512, Update the broken
cross-reference in docs/parallelisms.md: replace the non-existent
../knowledge/techniques/parallelism_strategies.yaml link with the relocated card
path under skills/perf-techniques (e.g.,
../skills/perf-techniques/parallelism-strategies/card.yaml) so the "Knowledge
card" link in the "Related Artifacts" section points to the correct card.yaml;
edit the link text that currently references parallelism_strategies.yaml and
ensure it matches the directory name used by the existing skill file
(parallelism-strategies) and the card.yaml filename.
| - `skills/perf-techniques/tp-dp-comm-overlap.md` | ||
| - `skills/perf-techniques/expert-parallel-overlap.md` |
There was a problem hiding this comment.
Update these skill references to the new .../SKILL.md layout.
These pointers still use the pre-split skills/perf-techniques/*.md paths, but this PR is moving skills into per-technique directories with SKILL.md/card.yaml pairs. At least skills/perf-techniques/expert-parallel-overlap/SKILL.md already follows that layout, so the current file references will be stale immediately.
📝 Suggested path fix
- `skills/perf-techniques/tp-dp-comm-overlap.md`
- `skills/perf-techniques/expert-parallel-overlap.md`
+ `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md`
+ `skills/perf-techniques/expert-parallel-overlap/SKILL.md`
...
- `skills/perf-techniques/tp-dp-comm-overlap.md`
- `skills/perf-techniques/expert-parallel-overlap.md`
+ `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md`
+ `skills/perf-techniques/expert-parallel-overlap/SKILL.md`Also applies to: 101-102
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/training/communication-overlap.md` around lines 12 - 13, Update the
stale skill file references that point to pre-split markdown files by changing
them to the new per-technique SKILL.md layout: replace
`skills/perf-techniques/tp-dp-comm-overlap.md` with
`skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` and replace
`skills/perf-techniques/expert-parallel-overlap.md` with
`skills/perf-techniques/expert-parallel-overlap/SKILL.md`; also apply the same
path updates to the other occurrences noted (101-102) so all references point to
the new SKILL.md locations.
| @@ -0,0 +1,84 @@ | |||
| # CUDA Graphs | |||
There was a problem hiding this comment.
Add this page to the docs toctree.
The docs build is already warning that this page is not included in any toctree, so the new document is currently orphaned and easy to miss. Please wire it into the relevant docs index/toctree before merging.
As per coding guidelines, "Ensure docs/index.md is updated when new markdown docs are added under docs/**/*.md or markdown files are renamed".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/training/cuda-graphs.md` at line 1, The new document
docs/training/cuda-graphs.md is orphaned; update the relevant toctree by adding
a reference to "training/cuda-graphs.md" in the docs index—specifically edit
docs/index.md (or the appropriate docs/training index) and insert the relative
path or title entry into the toctree/list so the page is included in the site
navigation; ensure the entry matches the existing toctree style and run a local
docs build to verify the warning is resolved.
| ```python | ||
| cfg.model.cuda_graph_impl = "transformer_engine" # or "local" | ||
| cfg.model.cuda_graph_scope = ["attn", "moe_router"] # scope list | ||
| cfg.model.cuda_graph_warmup_steps = 3 # warmup before capture | ||
| cfg.rng.te_rng_tracker = True # required | ||
| ``` | ||
|
|
||
| ### Key constraints | ||
|
|
||
| - `use_te_rng_tracker` must be `True` when `cuda_graph_impl != "none"`. | ||
| - `full_iteration` scope requires `cuda_graph_impl = "local"` and | ||
| `rerun_state_machine.check_for_nan_in_loss = False`. |
There was a problem hiding this comment.
Use one RNG-tracker setting name here.
The example uses cfg.rng.te_rng_tracker, but the constraint list switches to use_te_rng_tracker. One of those names is wrong, so the setup instructions are currently self-contradictory.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/training/cuda-graphs.md` around lines 45 - 56, The docs use two
different RNG-tracker names—cfg.rng.te_rng_tracker in the example and
use_te_rng_tracker in the constraints—so pick one and make them consistent;
update the constraint line to refer to the same symbol used in the example
(cfg.rng.te_rng_tracker) or conversely change the example to use
use_te_rng_tracker, and ensure the constraint referencing cuda_graph_impl still
reads "cfg.rng.te_rng_tracker must be True when cfg.model.cuda_graph_impl !=
'none'".
| # Hybrid / Hierarchical Context Parallel | ||
|
|
||
| This page covers the stable Bridge-facing meaning of hierarchical context | ||
| parallelism, especially the `a2a+p2p` transport path and | ||
| `hierarchical_context_parallel_sizes`. | ||
|
|
||
| For operational setup, code anchors, and verification commands, see | ||
| `skills/perf-techniques/hybrid-context-parallel.md`. |
There was a problem hiding this comment.
Add this new page to a Sphinx toctree before merging.
The docs pipeline is already warning document isn't included in any toctree [toc.not_included], so this page will be orphaned in the built docs until docs/index.md or the relevant training toctree includes it. Based on learnings, "Ensure docs/index.md is updated when new markdown docs are added under docs/**/*.md or markdown files are renamed".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/training/hybrid-context-parallel.md` around lines 1 - 8, The new doc
"Hybrid / Hierarchical Context Parallel" (hybrid-context-parallel.md) is not
listed in any Sphinx toctree and will be orphaned; update the relevant toctree
(e.g., add an entry in docs/index.md or the training toctree that aggregates
docs/training/*.md) to include "training/hybrid-context-parallel.md" so the page
is reachable in the built docs and the toc.not_included warning is resolved.
| summary: > | ||
| Megatron FSDP is the practical FSDP path in Megatron-Bridge today. PyTorch | ||
| FSDP2 exists in code, but remains experimental and failed at runtime during | ||
| live validation on the current branch. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat skills/perf-techniques/megatron-fsdp/card.yamlRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2172
Update torch_fsdp2_runtime status to reflect the documented runtime failure.
The field is marked unclear but known_limitations documents a specific FSDP2 runtime failure: "Live validation of the current branch hit a Torch FSDP2 runtime TypeError from pg_collection." Change the status from unclear to explicitly reflect this failure.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/perf-techniques/megatron-fsdp/card.yaml` around lines 3 - 6, The
torch_fsdp2_runtime status in the card.yaml is inconsistent with the documented
failure in known_limitations; update the torch_fsdp2_runtime field (in
skills/perf-techniques/megatron-fsdp/card.yaml) from "unclear" to an explicit
status reflecting the runtime failure (e.g., "failed" or "runtime_failure") and
ensure the value aligns with the known_limitations note about the Torch FSDP2
TypeError from pg_collection so readers see the runtime failure state
consistently.
| ```python | ||
| cfg = llama3_8b_pretrain_config() | ||
| cfg.dist.use_megatron_fsdp = True | ||
| cfg.ddp.use_megatron_fsdp = True | ||
| cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params" | ||
| cfg.checkpoint.ckpt_format = "fsdp_dtensor" | ||
| cfg.checkpoint.save = "/tmp/fsdp_ckpts" | ||
| cfg.checkpoint.load = None | ||
| ``` |
There was a problem hiding this comment.
Keep the copy-paste recipe aligned with the recommended override set.
The example on Lines 27-35 drops cfg.ddp.average_in_collective = False, even though the minimal override above and the perf-harness overrides on Lines 91-98 both include it. As written, users who copy the example end up testing a different configuration than the one this skill recommends.
Suggested fix
cfg = llama3_8b_pretrain_config()
cfg.dist.use_megatron_fsdp = True
cfg.ddp.use_megatron_fsdp = True
cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
+cfg.ddp.average_in_collective = False
cfg.checkpoint.ckpt_format = "fsdp_dtensor"
cfg.checkpoint.save = "/tmp/fsdp_ckpts"
cfg.checkpoint.load = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/perf-techniques/megatron-fsdp/SKILL.md` around lines 27 - 35, The
example config block for Megatron FSDP is missing the
cfg.ddp.average_in_collective = False override and thus diverges from the
recommended minimal override and perf-harness overrides; update the snippet that
sets cfg.dist.use_megatron_fsdp, cfg.ddp.use_megatron_fsdp,
cfg.ddp.data_parallel_sharding_strategy, cfg.checkpoint.ckpt_format,
cfg.checkpoint.save, and cfg.checkpoint.load to also set
cfg.ddp.average_in_collective = False so the copy-paste example matches the
recommended override set.
| 7. `CUDA_DEVICE_MAX_CONNECTIONS` and related env vars interact with | ||
| overlap settings. See `skills/perf-techniques/tp-dp-comm-overlap.md`. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find the correct path for tp-dp-comm-overlap documentation
fd -t f "SKILL.md" skills/perf-techniques/tp-dp-comm-overlap/
fd -t f "tp-dp-comm-overlap" skills/Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 121
Fix incorrect cross-reference path.
Line 208 references skills/perf-techniques/tp-dp-comm-overlap.md, but the file is located at skills/perf-techniques/tp-dp-comm-overlap/SKILL.md.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/perf-techniques/parallelism-strategies/SKILL.md` around lines 207 -
208, The cross-reference path in the paragraph mentioning
`skills/perf-techniques/tp-dp-comm-overlap.md` is incorrect; update that
reference to the correct target
`skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` so the link points to the
existing SKILL.md file (edit the occurrence of
`skills/perf-techniques/tp-dp-comm-overlap.md` in the current `SKILL.md` file
and replace it with `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md`).
| If CUDA graphs are enabled for this packed path: | ||
|
|
||
| ```python | ||
| cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True | ||
| cfg.dataset.dataset_kwargs["pad_to_max_length"] = True | ||
| ``` |
There was a problem hiding this comment.
Call out the metadata prerequisite for pad_cu_seqlens.
This block currently implies those two flags are sufficient, but src/megatron/bridge/data/datasets/sft.py:765-772 also asserts that packed-sequence metadata is present when pad_cu_seqlens=True. Without that note, users with custom packed datasets will hit an assertion after following this section.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/perf-techniques/sequence-packing/SKILL.md` around lines 39 - 44, The
doc block must mention that enabling pad_cu_seqlens requires packed-sequence
metadata to be present; update the text to explicitly state that when
cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True you must supply
packed-sequence metadata for your dataset (or disable pad_cu_seqlens), because
the dataset code (see src.megatron.bridge.data.datasets.sft assertion in the
packed-sequence handling) will assert that metadata exists; reference
pad_cu_seqlens, cfg.dataset.dataset_kwargs["pad_to_max_length"], and the
assertion in the sft dataset so users with custom packed datasets know to
generate/provide the required packed metadata.
| known_constraints: | ||
| - TP overlap requires tensor_model_parallel_size > 1. | ||
| - TP overlap requires sequence_parallel=True. | ||
| - DP overlap is tied to the distributed-optimizer path. | ||
| - PP overlap behavior depends on the pipeline schedule and is not identical for every PP setup. | ||
| - Launch-time environment tuning is part of practical overlap behavior. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the comm_overlap.py file and check around lines 421-465
find . -type f -name "comm_overlap.py" | head -5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 117
🏁 Script executed:
# Find the card.yaml file mentioned in the review
find . -type f -path "*/tp-dp-comm-overlap/card.yaml"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 124
🏁 Script executed:
# Check if the line range 421-465 exists and what's in it
if [ -f "src/megatron/bridge/training/comm_overlap.py" ]; then
wc -l src/megatron/bridge/training/comm_overlap.py
echo "=== Lines 421-465 ==="
sed -n '421,465p' src/megatron/bridge/training/comm_overlap.py
fiRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2311
🏁 Script executed:
# Search for Transformer Engine checks in the comm_overlap.py file
rg "TransformerEngine|TE|transformer.engine" src/megatron/bridge/training/comm_overlap.py -A 3 -B 3 || echo "No matches found"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2366
🏁 Script executed:
# Check the card.yaml file content to see the current known_constraints
cat src/megatron/bridge/skills/perf-techniques/tp-dp-comm-overlap/card.yamlRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 174
🏁 Script executed:
# Use the correct path found earlier
cat ./skills/perf-techniques/tp-dp-comm-overlap/card.yamlRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2113
Add the Transformer Engine prerequisite to known_constraints.
The card marks TP overlap gating as code-verified, but known_constraints omits a gate that the code enforces. In src/megatron/bridge/training/comm_overlap.py:444-446, TP overlap is disabled when Transformer Engine is unavailable (elif not HAVE_TE). This constraint should be documented alongside the tensor-parallel size and sequence-parallel requirements.
📝 Suggested addition
known_constraints:
- TP overlap requires tensor_model_parallel_size > 1.
- TP overlap requires sequence_parallel=True.
+ - TP overlap requires Transformer Engine to be available.
- DP overlap is tied to the distributed-optimizer path.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| known_constraints: | |
| - TP overlap requires tensor_model_parallel_size > 1. | |
| - TP overlap requires sequence_parallel=True. | |
| - DP overlap is tied to the distributed-optimizer path. | |
| - PP overlap behavior depends on the pipeline schedule and is not identical for every PP setup. | |
| - Launch-time environment tuning is part of practical overlap behavior. | |
| known_constraints: | |
| - TP overlap requires tensor_model_parallel_size > 1. | |
| - TP overlap requires sequence_parallel=True. | |
| - TP overlap requires Transformer Engine to be available. | |
| - DP overlap is tied to the distributed-optimizer path. | |
| - PP overlap behavior depends on the pipeline schedule and is not identical for every PP setup. | |
| - Launch-time environment tuning is part of practical overlap behavior. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/perf-techniques/tp-dp-comm-overlap/card.yaml` around lines 31 - 36,
Add a new known_constraints entry stating that TP overlap requires Transformer
Engine to be present; specifically mention that TP overlap is disabled when
HAVE_TE is false (see the gating in comm_overlap where the code checks elif not
HAVE_TE), so include "TP overlap requires Transformer Engine (HAVE_TE)". Keep
the same YAML list style and place it alongside the existing TP overlap
constraints referencing tensor_model_parallel_size and sequence_parallel.
…h HCP docs - Add skills/perf-techniques/moe-comm-overlap/ (SKILL.md + card.yaml) - Add skills/perf-techniques/packed-sequences-long-context/ (SKILL.md + card.yaml) - Enrich hybrid-context-parallel SKILL.md with implementation map and pitfalls - Move conceptual content (what/when) from skills into docs per source-of-truth split: - docs/training/hybrid-context-parallel.md: CP backends table, KV-head limit, decision tree - docs/training/communication-overlap.md: MoE dispatch/combine overlap details - docs/training/packed-sequences.md: three-path table (offline SFT, VLM, CP) - Note SP interaction with CP packing (pad_seq_to_mult = 2 * CP * TP) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
|
/ok to test 75c9995 |
|
/claude review |
| For operational details on configuring combined parallelism, troubleshooting | ||
| layouts, and memory estimation, see the | ||
| [parallelism strategies skill](../skills/perf-techniques/parallelism-strategies.md). | ||
|
|
There was a problem hiding this comment.
Bug: This links to ../skills/perf-techniques/parallelism-strategies.md, but the actual file is skills/perf-techniques/parallelism-strategies/SKILL.md. The link will 404.
| [parallelism strategies skill](../skills/perf-techniques/parallelism-strategies/SKILL.md). |
docs/parallelisms.md
Outdated
|
|
||
| - **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification |
There was a problem hiding this comment.
Two broken links here:
../skills/perf-techniques/parallelism-strategies.mdshould be../skills/perf-techniques/parallelism-strategies/SKILL.md../knowledge/techniques/parallelism_strategies.yaml— this file doesn't exist in the repo and isn't added by this PR
|
|
||
| Verify the supported MPU path: | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \ |
There was a problem hiding this comment.
The verification section references scripts/verify_hybrid_context_parallel.py, but this script doesn't exist in the repo and isn't added by this PR. Either add the script or replace this with a verification method that uses existing files.
- Fix broken cross-references: update all stale *.md skill paths to the new */SKILL.md layout used throughout this PR - Add cuda-graphs.md and hybrid-context-parallel.md to docs/index.md toctree so they are no longer orphaned in the Sphinx build - Fix RNG tracker naming: use cfg.rng.te_rng_tracker consistently instead of mixing with use_te_rng_tracker in cuda-graphs.md - Add Local Checkpointing section to resiliency.md with When to Use It and Stable Constraints so the feature table entry is not incomplete - Add moe_shared_expert_overlap = False as a hard required constraint in expert-parallel-overlap/SKILL.md (it is asserted at runtime) - Replace hard-coded "41 passed" with "zero failures" wording in expert-parallel-overlap/SKILL.md so it does not drift - Add cfg.ddp.average_in_collective = False to FSDP example recipe to match the minimal override set documented above it - Update megatron-fsdp/card.yaml torch_fsdp2_runtime from unclear to known_failure to reflect the documented runtime TypeError - Replace non-existent scripts/verify_hybrid_context_parallel.py in hybrid-context-parallel/SKILL.md and card.yaml with existing test_decentralized_pg.py and a manual smoke-check recipe invocation - Add pad_cu_seqlens metadata prerequisite note in sequence-packing/SKILL.md (asserted in sft.py:765-772) - Fix cross-reference in parallelism-strategies/SKILL.md line 208 Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test 4f6cca7 |
All skill and cross-doc references in docs/training/ were plain backtick literals. Convert them to [text](relative-path) hyperlinks so they are clickable in GitHub and rendered Sphinx docs. Covers communication-overlap.md, cuda-graphs.md, hybrid-context-parallel.md, megatron-fsdp.md, packed-sequences.md, and resiliency.md. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test f543ff5 |
Summary
docs/training, operational guides inskills/perf-techniques, and machine-readable cards inknowledge/techniquesTest plan
pre-commit run --fileson the staged docs, skills, and knowledge filesMade with Cursor
Summary by CodeRabbit